Spec 038 Phase 3 — Class-A rules, correctness fixes, EC3 results, template typo fix by codemonkeychris · Pull Request #250 · microsoft/microsoft-ui-reactor

codemonkeychris · 2026-05-12T03:54:16Z

Summary

Phase 3 (Tier-3 rules): three new Class-A induced rules driven by the cross-agent reproducibility audit (gpt-5.5 + claude-sonnet-4.6 × 525-run corpora each) — GridSizeFactoryParensRule (CS1955, 146 events combined, top freq in both corpora, first cross-tier rule), GridSizePxRenameRule (CS0117, 9 events), TextBlockStyleHintRule (CS1061/CS0117, 5 events across two syntactic shapes). ThemeBackgroundSuffixRule reclassified Class-B → Class-A in the file-header comment.
Two critical correctness fixes uncovered by end-to-end smoke testing — both were silently no-op'ing every rule firing in production despite all unit tests passing:
1. CompilationLoader now resolves ProjectReference outputs from project.assets.json's libraries.<id> entries with type=project. Without this, Reactor itself (a project reference for every sample app) was invisible to RuleSymbolResolver and every rule's DeclaredTargets failed — the whole registry self-disabled on every real invocation.
2. Suggest-gate carve-out for Tier-3 rules. SuggesterOrchestrator takes tier2Enabled: bool; Tier-3 rules always run when their diagnostic code surfaces, Tier-2 stays gated. EC2 watch-item ("Phase-3 rules are the right lever — not Phase-2.x gate tuning") finally addressed in code.
Template-identity typo fix (Micrsoft.UI.Reactor.CSharp → Microsoft.UI.Reactor.CSharp in template.json) — was breaking dotnet new reactorapp resolution against accumulating template caches; hit 20/20 runs across both arms of the EC3-original batch.
Template-shape cleanup for agent reasoning: removed <ImplicitUsings>enable</ImplicitUsings> from the scaffolded csproj; explicit using directives (System + Microsoft.UI.Reactor + .Core + .Layout + Xaml + Xaml.Controls + static Factories) baked into App.cs. Every symbol now traces to a visible using at the top of the file. The starter App.cs only uses three of the seven imports — the rest are there for the namespaces the agent reaches for within the first few turns of any real app.
Skill streamlining: SKILL.md (top-level + reactor-getting-started plugin copy) gains the anti-probe + mur check pointer paragraphs that the EC3 trace analysis identified as load-bearing. reactor-getting-started Tier-1 trims (509 → 415 lines, −18%) — dropped the single-file dotnet run minimal-app block, the standalone csproj xml, the Mode-detection section duplicated by top-level SKILL.md, the App-entry-point section, and the package-cache directory tree. No load-bearing content removed; all five cuts have breadcrumb pointers to where the displaced content lives.
Cross-agent audit + EC3 results + reference doc. Audit at docs/specs/tasks/038-tuning-reports/2026-05-11-cross-agent-audit.md closes Data Checkpoint C's reproducibility bar. Reference doc docs/reference/mur-check-did-you-mean.md expanded to cover Phase 2 + 3, the cross-agent mining methodology, the gate carve-out, and the ProjectReference fix. EC3-original (PASS-with-caveats) + EC3-final (clean PASS) results recorded in docs/specs/tasks/038-mur-check-did-you-mean-implementation.md.

Phase 3 V1 ship verdict

EC3-final clean PASS landed 2026-05-12 — supersedes the EC3-original PASS-with-caveats verdict captured under that batch's contaminated-substrate run. Clean batch on eval/spec-038-ec3-2026-05-11 HEAD against the existing n=5 baseline:

Metric	calc-variant (n=5)	kanban-variant (n=5)
Tokens mean (Δ vs base)	195,477 (−33.7%)	387,236 (−21.2%)
Tokens median (Δ vs base)	180,040 (−37.1%)	400,466 (+31.7%, see below)
Tokens CV	28.4%	19.5% (vs base 74%)
Cost mean USD (Δ vs base)	$1.92 (−25.6%)	$3.12 (−25.7%)
Turns mean (Δ vs base)	6.4 (−2.2)	10.4 (0)
First-build OK	5/5	5/5
`failedToolCalls`	0	0
Template / cache failures	0 / 0 (one auto-recovered retry)	0 / 0

The kanban-median +31.7% delta is a distribution-tightening story, not a regression story. Base kanban distribution was 263K–1,118K tokens (CV 74%), bimodal — most runs sat near the floor, one r1 blowout dragged the mean while the median stayed artificially low at 304K. Variant kanban is 261K–464K (CV 19.5%), no fat tail, every run within 1.8× of best. The load-bearing finding is the 4× CV improvement, which is the predictability-as-a-feature signal the spec §11 risk row called out as deployable-workflow value (separate from any token-mean win). Second batch in a row (after EC1-RR) where this mechanism shows up; first batch where calc also tightens.

All four pass criteria cleared:

#	Criterion	Result
1	Tokens improve ≥ 5% on at least one arm	Pass — both arms
2	First-build OK ≥ 5/5 on both variants	Pass (5/5, 5/5)
3	No false-positive rule fires	Pass with low confidence — `failedToolCalls` 0/0; §11 guardrail retrofit (post-run `mur check --final` audit) still deferred for high-confidence assertion
4	CV ≤ EC1-RR	Pass (kanban 19.5% vs EC1-RR 54%)

Full results table at docs/specs/tasks/038-mur-check-did-you-mean-implementation.md § "EC3-final results — 5×N landed 2026-05-12".

One footnote worth recording

EC3-original measured 0/10 firings on the three new Class-A rules (GridSizeFactoryParensRule / GridSizePxRenameRule / TextBlockStyleHintRule). EC3-final doesn't break out per-rule counts, so we can't say whether the clean-PASS win includes any contribution from those three rules or whether it's entirely the structural fixes + template + skill changes carrying the result. The clean PASS supersedes the EC3-original verdict regardless — the rules are correct in isolation, pass Validation Gate bars #1–#4 + #6, and don't actively harm when silent. But "Phase 3 V1 shipped on Class-A rules that may not have fired in production-ish eval" is a footnote worth recording for whoever picks up this work next. The targeted-prompt batch at C:\temp\mur-targeted-prompt-spec.md is the load-bearing follow-up for empirical token-impact numbers on the three Class-A rules specifically.

Test plan

dotnet test tests/Reactor.Tests/Reactor.Tests.csproj -c Debug -p:Platform=x64 — 7179 passing / 46 expected skips
dotnet test tests/Reactor.IntegrationTests/Reactor.IntegrationTests.csproj with CreateTemplateTests filter — 2/2 passing on the corrected template identity
mur check --list-rules shows all six rules enabled with zero self-disables against samples/apps/wordpuzzle
Wordpuzzle end-to-end smoke at default --suggest-threshold 3: inject GridSize.Pixel(80) + GridSize.Auto() → both rules fire with full evidence suffixes (gate carve-out verified live)
mur pack-local against branch HEAD — Microsoft.UI.Reactor.0.0.0-local.nupkg carries the corrected template identity, explicit-usings App.cs, no implicit usings, and the trimmed agentkit/plugins/reactor/skills/reactor-getting-started/SKILL.md
Workstation ~/.templateengine cache drained of stale Micrsoft.UI.Reactor.CSharp entries and reinstalled clean
EC3-final clean PASS landed 2026-05-12 — superseded the EC3-original PASS-with-caveats verdict
Targeted-prompt batch for empirical Class-A rule validation — follow-up, doesn't block this PR's ship

Surface area

New rule files: src/Reactor.Cli/Check/Rules/{GridSizeFactoryParens,GridSizePxRename,TextBlockStyleHint}Rule.cs
Modified: src/Reactor.Cli/Check/{CompilationLoader,SuggesterOrchestrator,CheckCommand}.cs, src/Reactor.Cli/Check/Rules/ThemeBackgroundSuffixRule.cs
New tests: rule fixture pairs for the three new rules + RulePerformanceTests.cs (§3.1a perf bound) + TemplateMetadataTests.cs (typo regression)
Modified tests: CompilationLoaderTests.cs, SuggesterOrchestratorRuleTests.cs
Template: tools/Templates/templates/WinUIApp-CSharp/.template.config/template.json (typo fix), Company.ReactorApp1.csproj (drop ImplicitUsings), App.cs (explicit usings)
Skills: top-level SKILL.md, plugins/reactor/skills/reactor-getting-started/SKILL.md (Tier-1 trims, anti-probe note, mur check pointer, canonical-usings sync)
Docs: docs/reference/mur-check-did-you-mean.md (expanded through Phase 2 + 3 + cross-agent methodology); docs/specs/tasks/038-mur-check-did-you-mean-implementation.md (status snapshot, EC3-original + EC3-final results, cross-agent audit verdicts); new docs/specs/tasks/038-tuning-reports/2026-05-11-cross-agent-audit.md
CHANGELOG entries under ## [Unreleased]

🤖 Generated with Claude Code

Three new Class-A induced rules, motivated by the cross-agent audit at docs/specs/tasks/038-tuning-reports/2026-05-11-cross-agent-audit.md: - GridSizeFactoryParensRule (CS1955; 146 events combined gpt-5.5 + sonnet-4.6; first cross-tier rule since CS1955 is outside Tier-2 SupportedCodes): GridSize.Auto() -> GridSize.Auto (drop the parens). - GridSizePxRenameRule (CS0117; 9 cross-agent events): GridSize.Pixel / Pixels / Fixed -> GridSize.Px (WPF/WinUI legacy name -> Reactor's Px). - TextBlockStyleHintRule (CS1061/CS0117; 5 cross-agent events across both .Style(...) and `with { Style = ... }` shapes): hint toward Reactor's fluent text helpers since the element exposes no Style. ThemeBackgroundSuffixRule reclassified Class-B -> Class-A (paperwork only; cross-agent audit shows 27 events on the same key). Two critical correctness fixes uncovered by end-to-end smoke testing — both blocked any real-world rule firing before this commit: 1. CompilationLoader.ResolveReferences now walks libraries.<id> entries with type=project in project.assets.json and locates the most-recently -built matching .dll under that project's bin/ tree. Without this every rule's DeclaredTargets failed to resolve and the whole registry self-disabled on real mur check invocations (unit tests passed because they use synthetic in-memory compilations). Regression locked by CompilationLoaderTests.Resolves_ProjectReference_built_dll_from_project_assets_json. 2. SuggesterOrchestrator gains a tier2Enabled bool; CheckCommand.Run always builds the orchestrator (when the compilation loads) and passes the suggest-gate result in as tier2Enabled. Tier-3 rules always run when their diagnostic code surfaces; Tier-2 stays gated on small builds where its fuzzy match has near-0% precision (525-run calibration). This is the EC2 watch-item ("Phase-3 rules are the right lever — not Phase-2.x gate tuning") finally addressed in code. Two new orchestrator tests lock down both halves of the carve-out. §3.1a per-rule performance bound test landed (was deferred until first rule shipped): RulePerformanceTests.BestMatch_median_under_per_rule_budget asserts symbol-resolution + TryMatch median <= 0.5 ms per rule per diagnostic times 4 CI slack. Status snapshot in the implementation tasks doc updated to record the sonnet-4.6 corpus aggregation (368 fixes / 564 ranker rows / 41 clusters), the cross-agent audit verdicts (3 STRONG Class-A targets, plus TemplatedListView family that's STRONG-after-generalization-over-<T>, plus the gpt-5.5-only CS1955/GridElement family deferred to a third corpus drop), and the rule-PR queue with this commit's three Class-A rules marked authored. Branch is for spec 038 EC3 eval — see C:\temp\mur-ec3-handoff.md. Full Reactor.Tests suite: 7175 passing / 46 expected skips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The reference doc was scoped to Phase 0+1 plus the suggest-gate. This update extends it to cover everything shipping since: - Phase 2 (merged): MSBuild passthrough via `--`, mode flags (--strict/--final/--quiet/--emit-threshold), the deterministic pre-emit policy table, and the suppress-to-error guardrail tool. - Phase 3 (in flight on this branch): the IRulePattern infrastructure, RuleSymbolResolver / RuleRegistry, --disable-rule + --list-rules CLI surface, six authored rules (three Class-A induced + three Class-B vocabulary), the symbol-binding contract from §3.1a, and the per-rule perf bound test. - Two critical correctness fixes uncovered during Phase 3 end-to-end smoke testing: CompilationLoader's ProjectReference resolution path and the suggest-gate carve-out for Tier-3 rules. Both get their own subsections in §3 explaining why unit tests passed while production silently no-op'd, since that failure mode generalises beyond this spec. - The cross-agent mining drop (`claude-sonnet-4.6` × 525 runs) and the audit it produced. New subsection in §4 on comparing models to separate structural vocabulary-confusion signals from agent-specific idiosyncrasies; new subsection in §5 on what the second-agent corpus changed (B->A promotions, single-corpus deferrals, cross-syntactic- shape rule emergence). §9 (Future improvements) tightened to what's actually left: remainder of Phase 3 (more rules pending a third-agent corpus + Class-B catalog expansion), Phase 4 (telemetry + learned ranker, blocked on Data Checkpoint D), and a "what EC3 will tell us" subsection that frames EC3 as a fresh measurement rather than an incremental delta on EC2. Glossary gains: rule carve-out, pre-emit ranker, symbol-binding contract, ProjectReference resolution, cross-agent audit, provenance. TOC updated for the §8 rename ("in this PR" -> "so far"). Tone matches the existing doc: plain language first, then engineer detail, then ML-practitioner detail. The same explanatory pattern spec 038's design doc uses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PASS-with-caveats. Calc cleared the >=5% token-improvement bar (-5.2% mean, -13.0% median); kanban regressed (+14.9% mean, +60.7% median). The three Class-A rules added in this branch fired zero times across all 10 variant runs - the EC3 delta did not exercise. The calc improvement is plausibly driven by the CompilationLoader + gate carve-out fixes letting rules run at all, not by the new rules. Tool-call profile diff identifies the +3.2 turn delta on kanban: variant agent does ~+1 skill load, +1 view, +1 apply_patch per run vs base, consistent with a "verify-before-edit" loop triggered by rule suggestions. Mechanism cited in handoff section 7. Recommend: do not declare Phase 3 cleared on this batch alone. Re-run with prompts that target GridSize/TextBlock patterns to get Class-A rule exercise; investigate the kanban-base R1 outlier (1.12M tokens, 3.4x median) before reading the kanban regression as decisive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…edit framing Per-call inspection of variant kanban rg patterns and view paths doesn't support the earlier "verify-before-edit" hypothesis: ~11/12 rg calls probe drag/drop and modifier APIs unrelated to mur output; view calls are mostly the agent re-reading its own in-progress workspace files. The two rule-fired runs (r1=Theme, r4=Align) are middle-of-the- pack on turns and tools, not the heaviest. The variant mean is dragged up by r5 (20 turns, 27 tool calls, 889K tokens, zero rule fires) which looks like a generic long-tail trajectory comparable to base R1. Reframing: rule fires correlate with normal token usage when they happen; mur check can't help on builds where the agent's mistakes fall outside the rule set's coverage. The kanban-prompt -> rule- coverage gap is the underlying issue, not rule-induced verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single-character bug in tools/Templates/templates/WinUIApp-CSharp/ .template.config/template.json lines 5-6: the template's identity and groupIdentity were "Micrsoft.UI.Reactor.CSharp" / "Micrsoft.UI.Reactor" (missing the second 'o'). Checked into the repo since at least Phase 1. How it surfaced. The eval harness runs `dotnet new install ... --force` on every setup. Multiple installs accumulate duplicate entries in the user's ~/.templateengine/dotnetcli/<sdk>/templatecache.json under the misspelled identity. The duplicate-match condition makes `dotnet new reactorapp` resolve more than one template for the "reactorapp" short name, throwing "Sequence contains more than one matching element" with exit code 70. The EC3 5x2 batch hit this 20/20 runs across both arms — the spec doc's earlier framing ("agent typo, at least one variant kanban run, didn't block the build") was wrong on three counts; corrected in this commit. Why the existing integration test (CreateTemplateTests) didn't catch the typo: it installs the template into a per-test ephemeral --debug:custom-hive, where the misspelled identity is the only entry and `dotnet new` resolves correctly. The bug only surfaces against the user's real (accumulating) cache. The new test (described below) is content validation, not install/run behavior — orthogonal coverage that catches the typo regardless of cache state. Test added: tests/Reactor.Tests/TemplateMetadataTests.cs. Four xUnit [Fact]s that load template.json directly: - Identity_is_canonical_brand_namespace: exact-match assertion against "Microsoft.UI.Reactor.CSharp". - GroupIdentity_is_canonical_brand_namespace: exact-match against "Microsoft.UI.Reactor". - File_contains_no_brand_typos: substring sweep for "Micrsoft" anywhere in the file (belt-and-suspenders catch for future typos in any new symbol/description/etc.). - ShortName_resolves_to_reactorapp: anchors the public CLI command name documented in SKILL.md and the wordpuzzle smoke pattern. Workstation cache drained + reinstalled: `dotnet new uninstall Microsoft.UI.Reactor.ProjectTemplates` repeated until empty, then `mur pack-local` repacked against the fixed template, then `dotnet new install` reinstalled. ~/.templateengine cache now carries exactly one canonical "Microsoft.UI.Reactor.CSharp" entry across both SDK versions on disk (10.0.104, 10.0.203). Existing tests unaffected: Reactor.Tests 7179 passing / 46 expected skips (up from 7175, +4 from the new template-metadata tests). CreateTemplateTests integration smoke (`dotnet new reactorapp` + build + run + UI Automation find) passes 2/2 with the corrected identity. EC3 verdict implication: both arms hit the typo equally, so the relative deltas (calc -5.2%, kanban +14.9%) are not biased *by this bug*. Absolute costs are inflated on every run; the long-tail outliers (variant r5 = 889K tokens, base r1 = 1.12M tokens) likely had their trajectories pushed further by `dotnet-new` thrash. The PASS-with- caveats verdict still stands directionally; a re-run with the typo fixed could materially shift the numbers in either direction. Spec doc updated to reflect this. Two harness-side mitigations deferred to separate follow-ups (the source typo is the load-bearing fix; without it the harness mitigations would still leak): 1. `dotnet new uninstall Microsoft.UI.Reactor.ProjectTemplates` before `dotnet new install --force` in eval setup, so future typo-equivalent bugs can't accumulate. 2. Propagate inner-command exit codes into the PowerShell tool wrapper's `success` field so `failedToolCalls` stops lying about dotnet-new failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

EC3-post-typo-fix smoke trace analysis identified two reads following the scaffold: `view App.cs` (essential — the file the agent is about to apply_patch) and `view <project>.csproj` (defensive — the agent checking that the scaffold produced a sane csproj). The .csproj read is informational at best: the scaffold's stdout already showed the file listing plus "Restore succeeded.", and a calc/kanban-shaped task never modifies the .csproj. Across the prior 10 variant runs, calc averaged 2.2 views/run and kanban averaged 2.2 (r5's 4 reads pulling the kanban mean up). Two views post-scaffold is the modal pattern, so a one-line skill note landing on the defensive read should compress noticeably. Added the same one-line note in two places so both skill consumers see it: - plugins/reactor/skills/reactor-getting-started/SKILL.md right after the canonical .csproj block (line ~102, next to the WindowsPackageType / UseWinUI MUST rules). - SKILL.md (top-level, packed into the nupkg) right after the matching csproj block in the Project Setup section. The wording explicitly carves out App.cs as still-necessary so the note doesn't suppress useful reads. Estimated savings: one view + a few hundred tokens per scaffold step. Small per-run, real across the batch since every eval scaffolds. Repacked Microsoft.UI.Reactor.0.0.0-local.nupkg so the bundled agentkit/SKILL.md carries the update. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…anti-probe + mur check pointer Three related changes addressing the post-scaffold agent-confusion pattern the EC3-post-typo-fix trace analysis identified: 1. tools/Templates/templates/WinUIApp-CSharp/Company.ReactorApp1.csproj removes <ImplicitUsings>enable</ImplicitUsings> and the three <Using Include="..."> items. Reactor's namespaces are now explicitly `using`-imported at the top of App.cs: using Microsoft.UI.Reactor; using Microsoft.UI.Reactor.Core; using static Microsoft.UI.Reactor.Factories; Why: with implicit usings on, the source file looks like it's missing namespace context — `VStack`, `Heading`, `Component` appear unqualified without a visible `using`, which confuses agents reasoning about where symbols come from. The agent has to read the csproj to find the global Using items, then mentally merge them into App.cs's namespace scope. Explicit usings make App.cs self-contained: every symbol's source is one of the three using directives at the top of the file. The skill text now says "App.cs has its own using directives at the top, which is the only place you add new namespaces" — which is true after this change. 2. SKILL.md + plugins/reactor/skills/reactor-getting-started/SKILL.md expand the existing "trust the scaffolded .csproj" note into an anti-probe paragraph that enumerates the exact post-scaffold file list: "the workspace contains exactly two source files: App.cs (entry point + initial component) and <Name>.csproj. There is no Program.cs and no GlobalUsings.cs — modify App.cs in place." Why: the eval orchestrator's trace analysis identified a recurring "agent probes for files that don't exist" pattern (sometimes asking for Program.cs, sometimes inspecting obj/GlobalUsings.g.cs). Pinning the file list in the skill is a one-paragraph fix. 3. Same two SKILL.md files add a 1-paragraph mur check pointer alongside the anti-probe note: "Verify your edits with mur check before declaring done... For anything more involved than the build/fix loop — strict-mode failures, custom diagnostic gating, MSBuild passthrough flags — load the reactor-build-and-check skill." Why: the deeper reactor-build-and-check skill is a heavy load (full --strict / --final / --quiet / --emit-threshold / --suppress-error surface plus the iter/final framing). Most agent runs just need the basic loop. Promoting mur check into getting-started with a one-liner for the basic case lets the agent stay in the lighter skill until they actually hit advanced behavior. Verified: dotnet new reactorapp -n X builds clean in both the top-level- program default and the --use-program-main true variant. Existing CreateTemplateTests integration smoke (2/2) and TemplateMetadataTests unit tests (4/4) pass. mur pack-local refreshes both nupkgs against the new template + skill content. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The prior commit (94d563f) added three usings to App.cs (Microsoft.UI. Reactor, .Core, static Factories) so the source is self-contained after dropping <ImplicitUsings>. The skill's "Required imports" section documents the full *canonical* set as five-plus-one — adding Microsoft.UI.Reactor.Layout, Microsoft.UI.Xaml, and Microsoft.UI.Xaml.Controls to the minimum three. The template and the skill now diverged: the agent reading App.cs would see three usings but the skill text says the canonical set is six. Sync the template to the skill: App.cs now ships all five-plus-one using directives, with the same `// FlexDirection, FlexJustify, ...` inline comments the skill uses for each non-obvious namespace. The starter App.cs still only uses three of them (Reactor, Core, static Factories); the other three are there because the agent will reach for them within the first ~5 turns of any real app (alignment enums, InfoBarSeverity, FlexDirection). Updated the SKILL.md anti-probe paragraph in both copies to point at `using System.Linq;` as the example of "when you add a new namespace, add it to App.cs's using block" — System.Linq is a real common add and isn't in the canonical six, so the example stays accurate. The top- level SKILL.md also explicitly names the canonical set so readers can cross-reference without flipping to the imports section. Verified: dotnet new reactorapp -n X builds clean in the default variant. CreateTemplateTests integration smoke 2/2 and TemplateMetadataTests 4/4 pass against the expanded usings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

With ImplicitUsings disabled (template commit 94d563f) the agent doesn't get System auto-imported. Common BCL surface — Action, Func, EventArgs, DateTime, Math, TimeSpan, Random — all live there, and they show up within the first few turns of any non-trivial app (event handlers, timers, randomization, formatting). Adding `using System;` to the template's App.cs eliminates the "Action does not exist in the current context" miss that's otherwise the first thing the agent hits when they author an event handler. Synced the canonical set in three places so they stay coherent: - tools/Templates/templates/WinUIApp-CSharp/App.cs (scaffold output) - plugins/reactor/skills/reactor-getting-started/SKILL.md "Required imports" code block - SKILL.md anti-probe note's parenthetical canonical-set list Verified: scaffolded App.cs ships `using System;` at the top of the canonical seven-line using block; default-variant build clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five surgical removals identified by the analysis pass: A. "Minimal app — single file" block (36 lines → 5). The single-file `dotnet run App.cs` flow is a side path now that `dotnet new reactorapp` is the primary entry; the canonical-shape teaching moved into the scaffolded App.cs. Kept a 1-paragraph pointer to reactor-build-and-check's single-file-scripts section for the demo case. B. Standalone `.csproj` xml block (17 lines dropped). The xml taught the agent how to write a csproj from scratch — but the agent doesn't author one. `dotnet new reactorapp` produces it. Kept the "when to use a .csproj" framing + the WindowsPackageType / UseWinUI MUST-rules + the recently-added anti-probe + mur check paragraphs. C. "Mode detection — selfhost vs. NuGet consumer" section (29 lines → 2). The top-level SKILL.md already owns selfhost/consumer bootstrap; re-explaining it here was a second copy. The new one-paragraph "Bootstrap" section breadcrumb-points readers to SKILL.md and keeps the load-bearing `mur pack-local` recovery tip inline. D. "App entry point" section (13 lines → 0). The ReactorApp.Run<App> form is already in the scaffolded App.cs. The unique content was the inline-render-function form `ReactorApp.Run("T", ctx => ...)` — embedded that as a one-line addendum to §Components instead of carrying a whole section for it. E. "Where the skill content comes from" package-cache directory tree (6 lines dropped). The literal `%USERPROFILE%\.nuget\...` block was reference material an agent can `find` on demand. Kept the plugin-channel framing + the api-index pointer + the "read once, cache in working memory" tip. What's preserved unchanged: - The React→Reactor table (highest-value block in the file) - Components / Hooks / Common factories / Theme tokens / Critical gotchas (load-bearing reference content) - The recent anti-probe + mur check paragraphs - The trimmed sections still carry their breadcrumb pointers so agents looking for the removed content find their way to the right skill (reactor-build-and-check, top-level SKILL.md, etc.) Tier 2 (move drag-and-drop to reactor-input, trim Context, drop duplicate List/UseReducer callout) and Tier 3 (move ContentDialog + Flyout to reactor-recipes) are follow-up considerations, not applied in this commit — they want eval validation before landing. Repacked Microsoft.UI.Reactor.0.0.0-local.nupkg so the bundled agentkit/plugins/reactor/skills/reactor-getting-started/SKILL.md carries the trimmed file (verified: nupkg copy is 415 lines, matches). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Append the EC3-final results subsection to the implementation tasks doc; mark EC3-original as superseded (preserved as the historical record of the typo-contaminated batch and the PASS-with-caveats reasoning that drove the watch-item triage work). EC3-final headline numbers (5×N landed 2026-05-12 on eval/spec-038-ec3-2026-05-11 @ 053afe9): calc: tokens −33.7% mean / −37.1% median, cost −25.6%, turns −2.2, CV 28.4%, first-build 5/5 kanban: tokens −21.2% mean (median +31.7% is the distribution-tightening artifact, not a regression — base CV 74% bimodal vs variant CV 19.5% no-fat-tail), cost −25.7%, turns 0, first-build 5/5 The 4× kanban CV improvement is the load-bearing finding — second batch in a row (after EC1-RR) where the predictability-as-a-feature signal shows up, first batch where calc also tightens. All four EC3 pass criteria cleared. Spec §12's "~−$0.70 per run" prediction comfortably exceeded on both arms ($0.66 calc, $1.08 kanban). Spec EC3 row's "~−2 turns" prediction hits calc exactly. One unresolved footnote: per-rule firing counts weren't broken out in this batch. EC3-original was 0/10 on the three new Class-A rules; this clean PASS may be carried entirely by the structural fixes + template + skill changes with the three new rules still inert. The verdict supersedes EC3-original regardless (rules are correct in isolation, pass bars #1-#4 + #6, don't actively harm when silent), but the targeted-prompt batch at C:\temp\mur-targeted-prompt-spec.md remains the load-bearing follow-up for getting empirical token- impact numbers on those three rules specifically. Watch-items carried into V1 / Phase 4 review: - Class-A rule exercise via targeted-prompt batch - §11 risk-row guardrail retrofit (post-run mur check --final audit) - Tier-2 SKILL.md trims (now empirically de-risked) - rule_fired trace event addition Verdict: PASS, clean. Phase 3 V1 cleared to ship. PR #250 updated with the same verdict in its body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR advances mur check’s “did-you-mean” engine (spec 038 Phase 3) by adding three new Class‑A Tier‑3 rules, fixing two production-blocking correctness gaps (ProjectReference reference resolution + Tier‑3 suggest-gate carve-out), and repairing/streamlining the dotnet new reactorapp template and associated skill/docs/test coverage.

Changes:

Add three new Tier‑3 Class‑A induced rules (GridSize parens, GridSize Px rename, TextBlock Style hint) plus fixture tests and a perf-bound test.
Fix real-world rule execution by resolving ProjectReference outputs from project.assets.json and by gating Tier‑2 only (Tier‑3 rules always run when their codes surface).
Fix template identity typo and adjust the template shape (drop implicit/global usings; add explicit using block in App.cs), plus skill/docs/changelog updates.

Show a summary per file

File	Description
tools/Templates/templates/WinUIApp-CSharp/Company.ReactorApp1.csproj	Removes implicit/global usings from the scaffolded project.
tools/Templates/templates/WinUIApp-CSharp/App.cs	Adds explicit `using` directives for the scaffolded starter app.
tools/Templates/templates/WinUIApp-CSharp/.template.config/template.json	Fixes template `identity`/`groupIdentity` typo (`Micrsoft` → `Microsoft`).
tests/Reactor.Tests/TemplateMetadataTests.cs	Adds unit tests guarding template metadata/branding invariants.
tests/Reactor.Tests/CheckCommandTests/SuggesterOrchestratorRuleTests.cs	Adds tests asserting Tier‑3 rules still fire when Tier‑2 is suggest-gated off.
tests/Reactor.Tests/CheckCommandTests/Rules/TextBlockStyleHintRuleTests.cs	Fixture tests for `TextBlockStyleHintRule` (positive + negative).
tests/Reactor.Tests/CheckCommandTests/Rules/RulePerformanceTests.cs	Adds perf bound test for per-rule `BestMatch` cost.
tests/Reactor.Tests/CheckCommandTests/Rules/GridSizePxRenameRuleTests.cs	Fixture tests for `GridSizePxRenameRule` (positive + negative).
tests/Reactor.Tests/CheckCommandTests/Rules/GridSizeFactoryParensRuleTests.cs	Fixture tests for `GridSizeFactoryParensRule` (positive + negative).
tests/Reactor.Tests/CheckCommandTests/CompilationLoaderTests.cs	Adds regression test for resolving ProjectReference-built DLLs via assets.json.
src/Reactor.Cli/Check/SuggesterOrchestrator.cs	Introduces `tier2Enabled` gating (Tier‑2 only) while always allowing rules.
src/Reactor.Cli/Check/Rules/ThemeBackgroundSuffixRule.cs	Updates rule header docs to reflect Class‑A evidence/reclassification.
src/Reactor.Cli/Check/Rules/TextBlockStyleHintRule.cs	New Tier‑3 rule for missing `TextBlockElement.Style` patterns.
src/Reactor.Cli/Check/Rules/GridSizePxRenameRule.cs	New Tier‑3 rule mapping legacy `Pixel/Pixels/Fixed` → `Px`.
src/Reactor.Cli/Check/Rules/GridSizeFactoryParensRule.cs	New Tier‑3 rule for `GridSize.<property>()` CS1955 parens removal.
src/Reactor.Cli/Check/CompilationLoader.cs	Resolves ProjectReference outputs by scanning `libraries` entries in assets.json.
src/Reactor.Cli/Check/CheckCommand.cs	Loads compilation once and wires `tier2Enabled` through the orchestrator.
skills/reactor.api.txt	Updates API index content (new surfaced APIs).
SKILL.md	Updates top-level skill guidance (anti-probe + `mur check` workflow notes).
plugins/reactor/skills/reactor-getting-started/SKILL.md	Trims/reshapes getting-started skill and synchronizes scaffold/import guidance.
plugins/reactor/skills/reactor-dsl/references/reactor.api.txt	Updates packaged API index copy.
docs/specs/tasks/038-tuning-reports/2026-05-11-cross-agent-audit.md	Adds cross-agent reproducibility audit writeup.
docs/specs/tasks/038-mur-check-did-you-mean-implementation.md	Updates spec task status/results narrative through EC3 findings.
docs/reference/mur-check-did-you-mean.md	Expands reference doc to cover Phase 2–3 behavior and fixes.
CHANGELOG.md	Records new rules and correctness fixes under Unreleased.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 25/25 changed files
Comments generated: 2

…on non-suggestable builds Copilot review surfaced two substantive issues; both fixed. (1) RulePerformanceTests CombinedStub only carried targets for the three earlier Class-B rules. With RuleRegistry.Default now including the three new Class-A rules (GridSizeFactoryParens, GridSizePxRename, TextBlockStyleHint), those three were silently self-disabling during the perf test — TargetsResolve failed against the stub's missing GridSize and TextBlockElement types. The 'budget = 0.5ms × ruleCount' assertion then scaled by registry.All.Length (six) while only measuring three rules' actual cost, so the bound was 2× loose. Fix: extended the stub with Microsoft.UI.Reactor.GridSize (record struct with Auto/Star/Px matching the real shape) and Microsoft.UI.Reactor.Core.TextBlockElement (record). Added a stub-coverage guard at the top of the perf test that asserts every rule in RuleRegistry.Default.All resolves its declared targets against the test compilation — fails loudly with the missing target name and rule name if someone adds a new rule without updating the stub. Future-proofs the budget assertion. (2) CheckCommand.Run unconditionally called CompilationLoader.Instance.Load(path) after the EC3 gate carve-out refactor, even when no parsed diagnostic could plausibly produce a suggestion (no diagnostics at all; only Tier-2 codes with the gate closed and no rule covering them; only nullable/XML-doc warnings). The compilation load is 50–500 ms cold — .cs enumeration, file-set hash, full reference resolution including the new ProjectReference walk. Paying it on every clean mur check was wall-time regression on the happy path. Fix: added SuggesterOrchestrator.AnyDiagnosticIsSuggestable(diags, tier2Enabled, rules) — flat scan over the (small) diagnostic list against the union of Tier-2's SupportedCodes and every rule's DiagnosticCodes. Microseconds. CheckCommand.Run now gates the compilation load behind that pre-check: only loads when at least one diagnostic could plausibly produce a suggestion. Test coverage: - RulePerformanceTests: stub-coverage guard asserts every DeclaredTarget across RuleRegistry.Default.All resolves. - SuggesterOrchestratorRuleTests gains 5 new facts: * empty diag list → false (clean build skips load) * unrelated CS warnings (CS8602/CS8618) → false * CS1061 + tier2Enabled=true → true * CS1061 + tier2Enabled=false + no rule → false (gate-closed Tier-2-only path is non-suggestable) * CS1955 covered by rule + tier2Enabled=false → true (Tier-3 always runs) Verified: - Reactor.Tests 7184 passing / 46 expected skips (was 7179, +5). - CreateTemplateTests integration smoke 2/2. - Clean wordpuzzle mur check exits with no output (pre-check short-circuits — no compilation load). - Wordpuzzle with GridSize.Pixel(80) + GridSize.Auto() injected: both rules still fire under the default gate with full evidence suffixes. Pre-check correctly identifies the build as suggestable; nothing regressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codemonkeychris · 2026-05-12T12:10:43Z

Both Copilot CR comments addressed in e4e7c7d.

1. RulePerformanceTests stub coverage (line 35). Confirmed real — the stub carried targets for the three earlier Class-B rules but not for the new GridSize / TextBlockElement Class-A rules, so three rules were silently self-disabling and the budget = 0.5ms × registry.All.Length assertion was scaling by 6 while only measuring 3 rules' actual cost (2× loose bound).

Fix: extended CombinedStub with Microsoft.UI.Reactor.GridSize (record struct, Auto/Star/Px) and Microsoft.UI.Reactor.Core.TextBlockElement (record). Added a stub-coverage guard at the top of the perf test that asserts every rule in RuleRegistry.Default.All resolves its declared targets against the test compilation — fails loudly with the missing target name + rule name if someone adds a new rule without updating the stub. Future-proofs the budget assertion against the next Class-A wave.

2. CheckCommand.Run unconditional compilation load (line 158). Confirmed real — after the EC3 gate carve-out refactor, the compilation load happens on every invocation including clean builds, builds where only nullable/XML-doc warnings surfaced, and builds where Tier-2 is gated and no rule covers the codes. 50–500ms wall-time regression on the happy path.

Fix: added SuggesterOrchestrator.AnyDiagnosticIsSuggestable(diags, tier2Enabled, rules) — a flat scan over the (small) diag list against the union of Tier-2's SupportedCodes and every rule's DiagnosticCodes. Microseconds. CheckCommand.Run gates the compilation load behind that pre-check. Five new tests in SuggesterOrchestratorRuleTests cover the truth table:

empty diag list → false (clean build skips load)
unrelated CS warnings (CS8602/CS8618) → false
CS1061 with tier2Enabled=true → true
CS1061 with tier2Enabled=false and no rule covering → false (gate-closed Tier-2-only path is non-suggestable)
CS1955 covered by a rule with tier2Enabled=false → true (Tier-3 always runs regardless of the gate)

Verification. Reactor.Tests 7184/46 (was 7179, +5 for the new pre-flight facts). CreateTemplateTests integration smoke 2/2. End-to-end against samples/apps/wordpuzzle: clean build (no diagnostics) exits with no output and skips the compilation load; build with GridSize.Pixel(80) + GridSize.Auto() injected fires both rules with full evidence suffixes under the default gate — confirms the pre-check correctly classifies the build as suggestable and nothing regressed on the rule firing path.

Use the modern Windows TitleBar (drag region, system menu, themed caption) as the top-of-window element and wrap content in a Border with 24px padding. Apply the same polish to the `mur --create` scaffolder so both entry points produce a presentable starter app. Align the scaffolder's emitted usings with the dotnet new template (PR #250) so generated apps have the common WinUI/Reactor namespaces ready to go. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

codemonkeychris and others added 11 commits May 11, 2026 18:41

codemonkeychris requested a review from Copilot May 12, 2026 11:43

Copilot started reviewing on behalf of codemonkeychris May 12, 2026 11:47 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread tests/Reactor.Tests/CheckCommandTests/Rules/RulePerformanceTests.cs Outdated

Comment thread src/Reactor.Cli/Check/CheckCommand.cs

codemonkeychris merged commit 9e0b012 into main May 12, 2026
7 checks passed

codemonkeychris deleted the eval/spec-038-ec3-2026-05-11 branch May 12, 2026 12:36

This was referenced May 12, 2026

Spec 038 — Phase 4 cleanup: targeted-prompt batch, guardrail retrofit, Checkpoint D, ranker training #252

Open

Spec 038 — rule_fired trace event for Tier-3 rule fires #251

Merged

nmetulev mentioned this pull request May 13, 2026

Polish default template: TitleBar + content padding #258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec 038 Phase 3 — Class-A rules, correctness fixes, EC3 results, template typo fix#250

Spec 038 Phase 3 — Class-A rules, correctness fixes, EC3 results, template typo fix#250
codemonkeychris merged 12 commits into
mainfrom
eval/spec-038-ec3-2026-05-11

codemonkeychris commented May 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

codemonkeychris commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

codemonkeychris commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Phase 3 V1 ship verdict

One footnote worth recording

Test plan

Surface area

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

codemonkeychris commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codemonkeychris commented May 12, 2026 •

edited

Loading